Goto

Collaborating Authors

 gradient propagation




e2065cb56f5533494522c46a72f1dfb0-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for insightful remarks and comments that help to considerably improve our manuscript. We1 address the most important ones in detail below. Before doing so, we highlight a comment from R3 in order to make an2 important clarification about the scope of our contribution. "It is well known that an attention mechanism would reduce3 gradient vanishing. It feels trivial to me as there is a direct connection for gradients to pass. We are in complete agreement and recognize that the very mechanism of (self-)attention was designed to improve6 gradient propagation over long sequences, and that sparsity is a good way to keep complexity costs low. Much like work from the '90s established formal results for gradient exploding/vanishing in deep/recurrent networks, we9 believe it is crucial to establish similar theoretical tools for attention mechanisms, as these methods are under intense10 development where scalability and complexity are important issues. The proposed relevancy mechanism and accompanying experiments,14 building on established work, are meant to illustrate how our theorems can be concretely exploited. We chose simple15 tasks for their ease of interpretation, and their variety of computational demands (memorization, prediction, RL, etc.).16 As is clearly indicated in the text, it is not our goal to propose this method "as is" in a race for state-of-the-art. Werecognize thatreviewersmay have basedtheir evaluation asthey wouldhavein amethod paper, and we20 kindly invite them to reconsider the value of our experiments in the broader context of our theoretical contributions. We21 also thank reviewers for their additional minor comments not explicitly addressed here and agree to implement them.22 R1: Q"The authors didn't spell out the relation between κ and d: higher κ tends to have smaller d.


Interpolation Technique to Speed Up Gradients Propagation in Neural ODEs

Neural Information Processing Systems

We propose a simple interpolation-based method for the efficient approximation of gradients in neural ODE models. We compare it with reverse dynamic method (known in literature as "adjoint method") to train neural ODEs on classification, density estimation and inference approximation tasks. We also propose a theoretical justification of our approach using logarithmic norm formalism. As a result, our method allows faster model training than the reverse dynamic method what was confirmed and validated by extensive numerical experiments for several standard benchmarks.



Low-rank surrogate modeling and stochastic zero-order optimization for training of neural networks with black-box layers

Chertkov, Andrei, Basharin, Artem, Saygin, Mikhail, Frolov, Evgeny, Straupe, Stanislav, Oseledets, Ivan

arXiv.org Artificial Intelligence

The growing demand for energy-efficient, high-performance AI systems has led to increased attention on alternative computing platforms (e.g., photonic, neuromorphic) due to their potential to accelerate learning and inference. However, integrating such physical components into deep learning pipelines remains challenging, as physical devices often offer limited expressiveness, and their non-differentiable nature renders on-device backpropagation difficult or infeasible. This motivates the development of hybrid architectures that combine digital neural networks with reconfigurable physical layers, which effectively behave as black boxes. In this work, we present a framework for the end-to-end training of such hybrid networks. This framework integrates stochastic zeroth-order optimization for updating the physical layer's internal parameters with a dynamic low-rank surrogate model that enables gradient propagation through the physical layer. A key component of our approach is the implicit projector-splitting integrator algorithm, which updates the lightweight surrogate model after each forward pass with minimal hardware queries, thereby avoiding costly full matrix reconstruction. We demonstrate our method across diverse deep learning tasks, including: computer vision, audio classification, and language modeling. Notably, across all modalities, the proposed approach achieves near-digital baseline accuracy and consistently enables effective end-to-end training of hybrid models incorporating various non-differentiable physical components (spatial light modulators, microring resonators, and Mach-Zehnder interferometers). This work bridges hardware-aware deep learning and gradient-free optimization, thereby offering a practical pathway for integrating non-differentiable physical components into scalable, end-to-end trainable AI systems.



Review for NeurIPS paper: Interpolation Technique to Speed Up Gradients Propagation in Neural ODEs

Neural Information Processing Systems

Weaknesses: I take issues with two aspects of this submission that lead me to recommend rejection at this point. The submission points out that the evaluations of z(t) at the Chebyshev grid points can be obtained without additional cost, e.g., at line 108. While this is true in some sense in general, there are many numerical theory aspects to this claim that are ignored here, both in the text as well as in the code. Runge-Kutta methods only guarantee high-order approximations at their own grid points. If high-order approximations are sought at pre-defined grid points, there are two solutions: a) the solvers are forced to include the pre-defined grid points as part of the otherwise adaptive mesh or b) a particular choice has to be made to find a smooth-interpolant Runge-Kutta formula.


Review for NeurIPS paper: Interpolation Technique to Speed Up Gradients Propagation in Neural ODEs

Neural Information Processing Systems

The four reviewers, all of whom are domain experts, agree that this is a good paper that delivers a delicate but useful methodological contribution to the growing area of NODEs. It should thus be accepted However, the reviewers have also raised several suggestions and requests for improvements. Please make sure to address them as much as possible to ensure this paper reaches its audience.


Contextual Gradient Flow Modeling for Large Language Model Generalization in Multi-Scale Feature Spaces

Quillington, Daphne, Fairbrother, Kingsley, Tattershall, Xavier, Kabakum, Irin

arXiv.org Artificial Intelligence

Optimization methodologies for training large-scale neural architectures often rely on uniform gradient propagation mechanisms that fail to align with hierarchical linguistic structures, limiting their capacity to generalize across diverse language distributions. A structured gradient refinement framework was introduced to incorporate multi-scale contextual adjustments, improving parameter adaptation through dynamic weighting strategies that enhanced representation coherence. Empirical evaluations demonstrated that structured propagation mechanisms contributed to reductions in gradient oscillations, resulting in more stable training dynamics and improved optimization efficiency. The comparative performance assessment indicated that models incorporating hierarchical propagation strategies exhibited greater robustness in long-range dependency retention and cross-domain adaptation. The hierarchical adjustment of weight updates provided an alternative to conventional backpropagation, reducing sensitivity to initialization conditions while improving overall convergence efficiency. The experimental results confirmed that structured gradient propagation influenced representation learning trajectories, aligning parameter updates with broader linguistic dependencies rather than isolated token-level relationships. Statistical evaluations indicated that structured optimization strategies mitigated overfitting while preserving adaptability across heterogeneous text distributions. The findings established that structured gradient propagation provided an empirically validated framework for refining hierarchical representation learning, supporting more effective integration of linguistic dependencies into optimization dynamics.